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ABSTRACT 

Density-based cluster mining is known to serve a broad range 
of applications ranging from stock trade analysis to moving 
object monitoring. Although methods for efficient extrac- 
tion of density-based clusters have been studied in the lit- 
erature, the problem of summarizing and matching of such 
clusters with arbitrary shapes and complex cluster struc- 
tures remains unsolved. Therefore, the goal of our work is 
to extend the statc-of-art of density-based cluster mining in 
streams from cluster extraction only to now also support 
analysis and management of the extracted clusters. Our 
work solves three major technical challenges. First, we pro- 
pose a novel multi-resolution cluster summarization method, 
called Skeletal Grid Summarization (SGS), which captures 
the key features of density-based clusters, covering both 
their external shape and internal cluster structures. Second, 
in order to summarize the extracted clusters in real-time, we 
present an integrated computation strategy C-SGS, which 
piggybacks the generation of cluster summarizations within 
the online clustering process. Lastly, we design a mecha- 
nism to efficiently execute cluster matching queries, which 
identify similar clusters for given cluster of analyst's interest 
from clusters extracted earlier in the stream history. Our ex- 
perimental study using real streaming data shows the clear 
superiority of our proposed methods in both efficiency and 
effectiveness for cluster summarization and cluster matching 
queries to other potential alternatives. 

1. INTRODUCTION 

Motivation. Mining complex patterns such as clusters 
and graphs from huge volumes of streaming data has been 
recognized as critical for numerous application domains. To 
facilitate such complex pattern mining process, a streaming 
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pattern mining system docs not only need to be equipped 
with highly efficient pattern extraction algorithms, but more 
importantly, it must also provide effective pattern analysis 
support, as motivated below: 

1) Pattern feature abstraction. The key features of 
detected patterns may be complex and thus may not be 
easily comprehensible for human analysts without analytical 
assistance. For example, in real-time traffic monitoring, a 
cluster representing a congestion area in the traffic of Beijing 
may be composed of lOK or even more vehicles and may 
spread to over lOfcm'^. By simply looking at the information 
about individual cluster members (vehicles), such as their 
positions and moving speed, an analyst may not be able to 
identify the key features of this cluster in real time, such as 
where is the key bottleneck causing the congestion. 

2) Pattern compression. Some patterns need to be 
kept for long-term analysis, yet keeping the full represen- 
tation of the complex patterns tends to be impractical in 
streaming environments. In the previous example, storing 
the full representation of the detected traffic congestion pat- 
terns (arbitrarily shaped clusters), namely the individual 
cluster member tuples (tens of thousands tuples for each 
cluster) would cause not only a huge burden on the storage 
space but also low efficiency for pattern transmission. 

3) Pattern retrieval (matching). For stream anal- 
ysis, the archived patterns may need to be retrieved based 
on their features. Using the above example, when a new 
traffic congestion arises, the analysts may ask whether sim- 
ilar congestion patterns have been detected before. If yes, 
rather than figuring out a new congestion-relief plan from 
scratch, the previous proven-to-work solution for such con- 
gestion patterns could be directly applied. 

In short, an effective pattern summarization method is 
the key for complex pattern analysis and management. It 
is needed for many different aspects of pattern analysis, 
including feature abstraction, compression and pattern re- 
trieval (as mentioned above). Also, the pattern summa- 
rizations can also be used for approximated pattern repre- 
sentation. For example, one can design pattern visualizar 
tion or full representation re-generation techniques based 
on pattern summarizations. In this work, our goal is to 
design effective summarization and matching tcchruques 
for density-based clusters in streaming environments, which 
remain open problems for database community. 

Sliding Window Semantics. In this work, we focus 
on density-based cluster mining in sliding stream windows 
[7, 8, 16, 17]. In this query semantics, arbitrarily shaped 
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clusters arc continuously detected within the most recent 
portion of the stream. The traffic congestion monitoring 
task discussed above is an example that requires such query 
semantics. Other applications that require such query se- 
mantics include detecting intensive-transaction areas (clus- 
ters) in most recent stock trades, and identifying malicious 
attacks (clusters) in current network trafBc. 

Challenges. Summarization and matching of density- 
based clusters is not only an unsolved but also a challenging 
problem. To serve real-time streaming applications, the pro- 
posed techniques must address the following challenges: 1) 
Cluster summarization must be sufficiently descriptive yet 
highly compact. The cluster structure of a density-based 
cluster is defined by a series of densely populated sub- regions 
and as well as the connections among them (See Figure 1). 
Clearly, simple statistical aggregations, such as the centroid 
or minimum bounding rectangle of a cluster, are insufficient 
for describing such complex pattern structure. 2) The clus- 
ter summarization process has to be highly efficient. A 
system conducting expensive online clustering can hardly af- 
ford additional system resources for summarizing clusters in 
real-time. 3) The summarized cluster representation needs 
to be effectively retrievable ( "matchable" ) . The match- 
ing process between cluster summarizations ought to loyally 
reflect the similarity between the original clusters, yet be 
computationally efficient. 

Proposed Solution. To address the above challenges, 
we first analyze density-based cluster structures and iden- 
tify their key characteristics, namely position, shape, connec- 
tivity and density distribution. To capture these features, 
we investigate two commonly-used summarization princi- 
ples, namely the graph-based and the grid-based strategies, 
We discover that neither of them alone is capable to pro- 
vide an effective summarization for density-based clusters. 
Therefore, we propose a hybrid solution, called Skeletal Grid 
Summarization (SGS). For descriptive power, SGS is shown 
to guarantee its fidelity to the original clusters on all key 
features. For compactness, our experimental study in Sec- 
tion 8 confirms that even the SGS of the highest resolution 
achieves on average a 98% compression rate of the full rep- 
resentation of the clusters. 

Empowered by the proposed SGS summarization, we de- 
sign a framework to support both continuous cluster ex- 
traction and cluster matching queries. A contirmous cluster 
extraction query in our system does not only extract clus- 
ters in their full representation (all cluster member objects) 
for online monitoring purposes like the other state-of-the-art 
techniques [3, 16], but it also concurrently compacts them 
into the SGS surmnarization. The full and the summarized 
(SGS) representation formats are complementary to each 
other, providing a description of the clusters at the individ- 
ual tuple and cluster feature level respectively. To extract 
these two representation formats simultaneously and in a 
highly efficient manner, we propose an integrated cluster 
extraction -|- summarization algorithm, C-SGS. C-SGS in- 
crementally maintains both the full representation and the 
corresponding SGS of the extracted clusters in an integrated 
manner. This results in an almost "free" cluster summariza- 
tion generation by piggy-packing the summarization process 
into the cluster extraction process itself. Our experimental 
study in Section 8 shows that C-SGS, which returns clus- 
ters in both full and summarized representation (SGS), has 
a neglectable overhead, compared with state-of-the-art al- 



gorithm Extra-N [16] computing the full representation of 
clusters only. In all our test cases, the extra response time 
of C-SGS compared with Extrar-N is consistently less than 
6% (Section 8.1). 

For any "to-be-matched" cluster specified by the analyst, 
a cluster matching query identifies sirrular clusters extracted 
earlier in the same stream from a pattern archive. To sup- 
port such queries, our framework first archives the SGS of 
the extracted clusters into a pattern archive. When execut- 
ing a cluster matching query, our system deploys a filter- 
and-reflne strategy. First, the filter-phase exploits a fear 
ture index to locate the potential matching candidates from 
the pattern store. Then, the refine-phase conducts a more 
detailed cluster match against these promising candidates 
and returns those with similarity above a given threshold. 
Our experimental study shows that, efficiency- wise, our sys- 
tem takes only 3 seconds on average to answer a cluster 
matching query against lOK archived clusters (Section 8.2). 
Quality-wise, our user study, which invites human analysts 
to visually compare the similarity between matched clus- 
ters, shows that human analysts agree with a significant 
larger percentage of the matched clusters found using our 
proposed matching mechanism compared to those found by 
alternatives (Section 8.3). 

Contributions. The main contributions of this work in- 
clude: 1) We propose the first summarization method specif- 
ically designed for density-based clusters, namely the Skele- 
tal Grid Summarization (SGS), 2) We present an integrated 
cluster mining and summarization algorithm, C-SGS, which 
efficiently computes the full representation and the SGS of 
the extracted clusters in one shot. 3) We develop a cluster 
matching mechanism based on SGS to efficiently processing 
cluster matching queries in real-time. 4) Our performance 
evaluation and user study using real streaming data confirm 
that our proposed techniques are clearly superior to other al- 
ternatives in all aspects, including summarization efficiency, 
cluster matching efficiency and matching quality. 

2. RELATED WORK 

The concept of density-based clustering was first proposed 
in [8]. It has drawn significant research attention [7, 16, 
17, 12, 3, 4], because of its capability of identifying clus- 
ters with arbitrary shapes and specified density. Previous 
work mainly studied how to efficiently extract such clusters 
in static [8, 7, 12] or streaming environments [16, 17, 3, 
4]. Also, given the prevalence of real-time monitoring tasks 
in stream applications, researchers have started to design 
visual platforms allowing human analysts to interactively 
explore such patterns in streams [14]. 

However, the fundamental problem of summarizing this 
important pattern type has not been studied in the litera- 
ture yet. Without an effective yet compact summarization 
method, each density-based cluster has to be expressed by 
its full representation, namely its cluster rncrnbcr objects. 
Obviously, such full representation is neither succinct nor 
does it explicitly reflect the features of each cluster. This 
causes serious inconvenience for both storage and analysis 
of density-based clusters. 

Traditional clustering methods [10, 19], such as k-mean 
style clustering, treat clusters as statistical phenomena. 
Therefore, many key features of the clusters, such as their 
shapes and densities, are summarized using a rather simplis- 
tic description. In particular, first, these works assume clus- 
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ters arc spherically shaped. Therefore, the shape of a clus- 
ter is usually described using a simple "certtroid + radius" 
formula. Second, the previous work do not capture the in- 
ternal features of the clusters, such as how its density is 
distributed. For example, the density of a cluster is either 
treated as uniform or varying along the radius only. Obvi- 
ously, such simple fornmla cannot well describe the complex 
cluster structure of density-based clusters. This is because 
both the shapes and density distributions of density-based 
clusters can be arbitrary, not to mention the complex sub- 
region connectivities in each cluster. To the best of our 
knowledge, no summarization method heis been specifically 
designed for density-based clusters. 

For computing cluster summarizations in streaming envi- 
ronments, if the clusters are treated as statistical phenom- 
ena, they are considered to be "aggregatable" over time [1, 
5]. For example, [1] used one Cluster Feature Vector (CFV) 
to represent each micro-cluster detected in the stream. They 
rely on the additivity property of the CFV to aggregate the 
cluster features over time and compare the features of a same 
cluster at different time points by subtracting its CFVs on 
the corresponding time points. 

However, the complex cluster structure of density-based 
clusters is not simply aggregatable over the sliding windows. 
The continuous expiration of old objects and arrival of new 
objects at each window may cause complex cluster structural 
changes, such as merge and split and connectivity changes 
within the clusters. Clearly, these changes cannot be sim- 
ply captured by aggregation results. Thus, these techniques 
cannot effectively capture the features of density-based clus- 
ters within sliding windows. 

3. PRELIMINARIES 

3.1 Density-Based Clustering in Windows 

Density-based cluster detection [8, 7] uses a range thresh- 
old 6'' > to define the neighbor relationship between ob- 
jects. For two objects pi and pj, if the distance between 
them is no larger than 6^ , pi and Pj are said to be neigh- 
bors. We use the function NumN eigh{pi, 9^) to denote the 
number of neighbors a object pi has, given the 6^ threshold. 

Definition 3.1. Density-Based Cluster: Given 9^ and 
a count threshold 9"^, an object pi with NumNeigh{pi,9^) 
> 9'^ is defined as a core point. Otherwise, if pi is a neighbor 
of any core object, pi is an edge point, pi is a noise point if it 
is neither a core object nor an edge object. Two core objects 
po and pn are connected, if they are neighbors of each other, 
or there exists a sequence of core points po,Pi, •■•Pn-i,Pn, 
where for any i with < i < n — 1, each pair of core points 
Pi and pi+\ are neighbors of each other. Finally, a density- 
based cluster is defined as a maximum group of "connected 
core objects" and the edge objects attached to them. Any 
pair of core objects within a cluster are "connected" . 

Figure 1 shows an example of a density-based cluster com- 
posed of 11 core objects (black) and 24 edge points (grey). 

We focus on periodic sliding window semantics as pro- 
posed in CQL [2] and widely used in the literature [16, 
17]. These proposed semantics can be either time- or count- 
based. Each query has a window with a fixed window size 
win and a fixed slide size slide (either a time interval or a 
tuple count). Clusters are generated for each window Wi 




Figure 1: Definition of Density-Based Clusters 

only based on those data points that fall into the same win- 
dow Wi. Each cluster is returned as all its cluster member 
objects associated with the same cluster identification. We 
call this typical output format the full representation of 
each cluster. 

3.2 Supported Queries and System Overview 

Our system support two types of analytical queries: 
Continuous Clustering Queries. A Continuous Cus- 
tering Query returns hoth. full (Section 3.1) and summarized 
representation of the extracted clusters (Figure 2). The de- 
sign of our proposed cluster summarization format will be 
introduced in Section 4. 



DETECT Density BasedCluster 8^+' FROM stream 

USING e^'"""' = r and 9""* = c 

IN Windows WITH win = w and slide = s 



Figure 2: Continuous Cluster Extraction Query Re- 
turning full (f) and summarized (s) representations 
of clusters 

Cluster Matching Queries. Given a user specified 
to-be- matched cluster d, a cluster matching query finds 
clusters similar to d that reside in the historical pattern 
archive. We show a template of such a query in Figure 3. 



GIVEN Density BasedCluster^ d 

SELECT DensityBasedCluster" Cj FROM History 

WHERE Distance{Ci,Cj) < simJthreshold 



Figure 3: Cluster Matching Query finding Clusters 
Similar to To-Be-Matched Cluster Based on Cluster 
Summarization 

The to-be-matched cluster can be any cluster specified by 
an analyst. Typically, it may be a cluster detected in the 
most recent portion of the stream that represent the newest 
characteristics of the stream. The matched clusters, if any, 
will be found in the historical pattern store, which archives 
the clusters extracted by Continuous Clustering Query ear- 
lier in the stream. 

3.3 System Overview 

To support these two types of analytical queries, we design 
a framework composed of four major components (Figure 4) . 
Here we give a brief overview of the functionalities of each 
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component, while in-depth technical details are discussed 
later in Sections 5 to 7. 
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Figure 4: System Overview 

The Pattern Extractor executes the Continuous Cluster 
Extraction Query (Figure 2) against the input stream. It 
outputs both full and summarized representations of the 
extracted clusters. Both representations are returned to the 
analyst for real-time monitoring. Meanwhile, the extracted 
clusters are also passed to the Pattern Archiver for storage, 
and to Pattern Analyzer for cluster matching. 

The Pattern Archiver selectively archives the newly de- 
tected clusters into the Pattern Base. These archived clus- 
ters constitute the Stream History available for subsequent 
Cluster Matching Queries (Figure 3) . The Pattern Archiver 
controls which extracted clusters should be kept in the Pat- 
tern Base and at which resolution they should be archived. 

The Pattern Base organizes the archived clusters. To 
facilitate cluster matching against historical clusters, it em- 
ploys multiple feature indices to organize the archived clus- 
ters. This helps the Cluster Matching Queries to quickly 
locate the potential matching candidates. 

The Pattern Analyzer executes the Cluster Matching 
Queries (Figure 3). If an analyst is interested in any newly 
extracted cluster and would like to learn whether similar 
clusters had been detected before in the Stream History, 
she can submit her Cluster Matching Query to the Pattern 
Analyzer to search for matches against the Pattern Base. 

4. CLUSTER SUMMARIZATION 
4.1 Features of Density-Based Clusters 

Based on our analysis, we identify four key features that 
define each density-based cluster, which can be divided into 
two categories, namely external and internal features. 
External Features: 

Location: The location of a cluster indicates its position 
in the data space. It provides basic information about each 
cluster, such as where a congestion area (a cluster) arises in 
the traffic, or in which price range an intensive-transaction 
area, a cluster based on price, volume and transaction time, 
is detected in the stock transaction stream. 

Shape: Density-based clusters can have arbitrary shapes. 
The shape is a key feature, because a certain shape of the 
cluster may convey specific meaning for an application. For 
example, for the clusters representing intensive-transaction 
areas in stock transactions, a cluster having a long spread 
on transaction price but short range on transaction time 
conveys that a large number of transactions of a certain 



stock happened in a short time period while the price of it 
fiuctuated dramatically within this time period. 
Internal Features: 

Connectivity: The connectivity of a density-based clus- 
ter describes how sub-regions within the cluster are con- 
nected. It is important for density-based clusters for both 
definition and application reasons. First, it defines internal 
structure of each cluster. The definition of the density-based 
cluster (see Section 3.1) relies on the connectivities among 
sub-regions to define a cluster. Second, the connectivies 
among sub-regions may be relevant to applications. For ex- 
ample, if two sub-regions within a single cluster representing 
a group of moving troops are not directly connected, then 
this may indicate the units in these two sub-regions cannot 
directly communicate with each other, because there are no 
connected "Head Nodes" (core objects) in these two sub- 
regions of their wireless network. 

Density Distribution: Although the definition of density- 
based clusters imposes a minimal density requirement on ob- 
jects in a cluster, the density of each cluster can be rather di- 
verse across its sub-regions. The density distribution within 
each cluster may be of an analyst's interest in many applica- 
tions. Using the earlier example, even in a single congestion 
area, the level of congestion (density of vehicles) may vary 
among sub-regions. Therefore, the density distribution in 
each sub-region may be the key for working out a conges- 
tion relief plan, as the super dense sub-regions may be the 
areas that cause the congestion. 

4.2 Initial Effort: 

Graph-Based Summarization Method 

Any effective summarized representation for density-based 
clusters has to capture the above four key features (Section 
4.1). Given that density-based clusters may vary arbitrarily 
in shape, connectivity and also density distributions, us- 
ing any aggregative method to represent these features will 
have rather poor descriptive power. Therefore, we propose 
to leverage an alternative strategy, namely the divide-and- 
conquer approach. We divide each cluster into sub-regions, 
and then we describe not only the features in each sub-region 
but also the interrelationships among the sub-regions. 

Given this divide-and-conquer strategy, we first introduce 
a possible summarization method based on graph theory. 
This method uses one representative object to represent each 
sub-region. We call it "Skeletal Point Summation" (SkPS): 

Definition 4.1. For each cluster d, the SkPS summa- 
rization of Ci is a graph G{V, E) composed of a minimal set 
of connected core objects ofd, called Skeletal Points as ver- 
tices V , whose neighborhoods together cover all the objects 
in this cluster, and connections among them as edges E. 

The graph composed of all core objects in Figure 1 is an ex- 
ample for SkPS. SkPS captures most of the cluster features 
and also has good compactness. However, it suffers from 
several serious shortcomings. First, SkPS has limited de- 
scriptive power for a cluster's density distribution. Second, 
such SkPS is not efficiently computable. For each cluster, 
identifying its SkPS is equal to the problem of identifying 
the connected dominant set in an undirected graph which 
has been proven to be NP-complete [9]. Third, SkPS is not 
a viable solution for matching, because a single cluster may 
have multiple SkPSs with rather different graph structures. 
Based on our analysis, these limitations suffered by SkPS are 
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caused by its overlapping and non-deterministic sub-region 
division strategy. In conclusion, SkPS does not constitute 
an ideal summarization for density-based clusters. A more 
detailed discussion of SkPS method can be found in our 
technical report [18]. 

4.3 Proposed Solution: 

Skeletal Grid Summarization Method 

Basics of Grid-Based Summarization. To solve the 
limitations suffered by SkPS, we propose to adapt SkPS by 

dividing each cluster into non-overlapping sub-regions. In 
particular, we divide the whole data space into uniformly 
sized grid cells. For each cluster, its sub-region division is 
now determined by the grid cells into which its members fall. 
Therefore, a cluster d can be represented by all the grid 
cells containing at least one of C,'s cluster member objects. 

Connectivity Preservation. However, this simplis- 
tic grid-based summarization lacks one key capability of the 
SkPS solution, namely it docs not capture the connectivity 
within clusters. In SkPS, both the inner and inter sub-region 
connectivity information of each cluster is well preserved. 
First, each sub-region in SkPS itself is "well connected", as 
all objects in a sub-region arc neighbors of the same skeletal 
point. Second, the inter connections among different sub- 
regions arc explicitly expressed by the "edges" in SkPS. 
While this simplistic grid-based summarization preserve nei- 
ther of these two types of connectivity information. 

Connectivities In Grid Cells. To solve this problem, 
we propose to integrate the concept of "connectivities" into 
the grid-based solution. As foundation, we first introduce 
the concept of status to a grid cell. We divide the grid cells 
in each cluster's summarization into two categories, namely 
"core ceHs" and "edpe cells". 

Definition 4.2. Core cells: a core cell of a cluster Ci 
contains at least one core object (See Def. 3.1) of d. 

Edge cells: an edge cell of a cluster d contains no core 
object, but at least one edge object (See Def. 3.1) of Ci. 

Noise cells: a noise cell contains neither core nor edge 
objects of any cluster. ^ 

For inner-sub-region connections, we follow the basic 
principle for the sub-region division strategy, which is to 
pursue homogeneity in each sub-region. In particular, we 
pick a fine grid size to guarantee that the objects that fall 
into the same grid cell are neighbors of each other. More 
precisely, the diagonal of each grid is set to be equal to the 
range threshold 9^ in the given clustering query (see Section 
3.1). This grid cell size selection will be relaxed later in 
our discussion of the multi-resolution cluster summarization 
(Section 6). Under this fine grid size selection, the core and 
edge cells can be shown to have the following properties. 

Lemma 4.1. All objects in a core cell belong to the same 
cluster. 

Proof: Since each core cell contains at least one core 
object and all the objects in each core cell are now neighbors 
of each other, it implies that all objects in the same core cell 
are neighbors of at least one common core object. Based 
on the definition of density-based cluster (see Def. 3.1), the 
neighbors of a core object belong to the same cluster. ■ 

^ noise grid are are only used in cluster computation stage. 



Lemma 4.2. The number of objects in an edge cell must 
be less than the count threshold 9'^ in the clustering query. 

Proof: We prove this lemma by contradiction. Given 
that all objects in a grid cell are neighbors of each other, if 
there are at least 9'^ objects in an edge cell, those objects 
would be core objects, as they all have at least 9^^ neighbors. 
This contradicts the definition of edge grid (Def. 4.2). ■ 

Given these properties, each grid cell is "well-connected" 
and constitutes a basic unit for the inter-grid connection 
expression, as defined below. 

For the inter-sub-region connection, we now define 
the "connections" between grid cells. 

Definition 4.3. Two core cells cdi and cd2 are directly 
connected, if there exists at least one core object pi in cdi 
and one core object pj in ccl2 that are neighbors of each 
other. Two core cells cclo and ccln are connected, if they 

are directly connected to each other, or there exists a se- 
quence of core cells cclo, cch, ...ccln-i, ccln, where for any i 
with < i < n—1, each pair of core cells cdi and cdj+i are 
directly connected with each other. 

An edge cell edi is attached to a core grid cdj, if there 
exists at least one object p, in ecU and one core object pj in 
cdj that are neighbors of each other. 

Two edge cells are neither connected nor attached. 

Given the connection definition for grid cells above, all 
core cells of a cluster d are connected to each other, and 
all edge cells are attached to at least one core cell of d. 

Skeletal Grid Summarization. Based on the status 
and connections of grid cells, we now give the definition of 
our proposed Skeletal Grid Summarization (SOS) method. 

Definition 4.4. A Skeletal Grid Summarization 

(SGS) of a density-based cluster d is composed of all grid 
cells that contain at least one cluster member object of d. 
We call each grid cell in a SGS, a Skeletal Grid Cell 
(Sc) of d. SGS = {Sco,Sci,...Scn}. Each Sa has five 
attributes, namely SGi = 

{location[], sidelength, population, status, connection[]) . 

1 ) location vector: a sequence of values, each indicating 
the minimum value on one of the dimensions covered by Sci . 

2) side length: the range of values on each dimension. 

3) population: the number of objects contained by Sci 

4 ) status: whether Sci is a core or edge cell. 

5) connection vector: a sequence of boolean connection in- 
dicators, each indicating Sd 's connection to one of its adja- 
cent skeletal grid cells. For any edge or noise cell, all connec- 
tion indicators are "false". For any core grid, a connection 
indicator is "true" if the corresponding adjacent skeletal grid 
cell Scj is a core cell and Sci and SGj are directly connected, 
or if SGj is an edge cell attached to SGi. 

Figure 5 shows an example of our proposed Skeletal Grid 
Summarization (SGS) for a 2D cluster. SGS achieves our 
goal of preserving all four features, as shown below. 

Lemma 4.3. Fidelity to Location and Shape: The 

data space covered by d.SGS is larger than that covered 
by the cluster member objects of d by a bounded error. 
Namely, any point in the data space covered by d.SGS is 
at most 9^ away from a cluster member object in d . 

Proof: The data space covered by Gi.SGS is composed 
of the union of the space covered by all its skeletal grid cells. 
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Figure 5: Example of full representation, basic SGS 
and compressed SGS of a 2D cluster 



Since all member objects of d fall into these grid cells, the 
data space covered by d.SGS is larger than that covered by 
Ci's member objects. Since each skeletal grid cell in d.SGS 
contains at least one member of d , and the diagonal of each 
cell is 9^ , any point in the data space covered by a skeletal 
grid cell is at most 6^ away from a member of Ci. ■ 

Lemma 4.4. Fidelity to Density Distribution: For 

any sub-region in a cluster d, which ts composed of n (n > 
1) grid cells, d.SGS can accurately express its density. 

Proof: Since the skeletal grid cells in d.SGS don't 
overlap, the population recorded by each skeletal grid cell 
accurately reflects the number of objects in it. Therefore, for 
any sub-region covered by the n skeletal grid cells belonging 
to d, we can accurately calculate its density by dividing its 
total population by its total volume. ■ 

Lemma 4.5. Fidelity to Connectivity: If there are two 
sub-regions in d connected through a connected core object 
path composed of n core objects, there must exist a core gnd 
path connecting these two sub-regtons with at most n core 
cells on this path. 

Proof: Since any skeletal grid cell containing a core 
object is a core cell, if there exists a core object path between 
two sub-regions, there must exist a core cell path between 
them. In the worst case, each core grid on this core gnd 
path contains only one core object. Thus the length of the 
core grid path is at most equal to the length of the core 
object path. ■ 
In conclusion, SGS effectively captures all key features of 
density-based clusters using a compact description. 

5. PATTERN EXTRACTOR 

Next, we introduce the pattern extractor that executes 
the Continuous Clustering Query (Section 3.2), outputting 
clusters in both full and summarized (SGS) representations. 
To provide such functionalities, a straightforward approach 
would be a two-stage process, namely cluster extraction fol- 
lowed by summarization. However, this strategy causes a 
significant performance overhead compared to cluster ex- 
traction only. An in-depth analysis of such a two-phase 
strategy can be found in our technical report [18]. 

5.1 Proposed Solution: Integrated Process 

To solve this problem, we instead propose an integrated 
strategy that incorporates cluster extraction and summa- 
rization into a single process. The key observation that mo- 
tivates this integrated computation method is given below. 



Observation 5.1. The main tasks for both density-based 
cluster extraction and SGS computation are the same, namely 
to first identify the connections (neighborships) among the 
objects and analyze them to form the cluster structures (in 
either the full or a summarized representation). 

This observation reveals the key commonality among the 
cluster extraction and summarization processes. Based on 
it, we design an integrated extraction+summarization meth- 
od to effectively share the neighborship identification and 
cluster formation processes. 

5.2 Incremental Computation and Challenges 

To avoid conducting the prohibitively expensive clustering 
process from scratch at each window, our proposed method 
incrementally maintains the cluster structures across the 
windows. To realize incremental computation, we need to 
find an appropriate meta-data that can be maintained for 
both the full and summarized cluster representations. Our 
proposed solution is that, besides the raw data falling into 
each window, which needs to be maintained for cluster ex- 
traction in any case, we incrementally maintain the skeletal 
grid cells in the data space. With updated skeletal grid 
cells, we can easily output both the summarized and full 
representations of detected clusters. First, based on connec- 
tions among the skeletal grid cells, we can easily determine 
the summarized representation SCS (a group of connected 
skeletal grid cells) for each cluster. Second, given the SGS of 
a cluster Gi, d.SGS, we can figure out the cluster member 
objects of Gi based on the objects falling into the respective 
skeletal grid cells belonging to d.SGS. 

However, incrementally maintaining skeletal gnd cells in 
an efficient manner is a challenging task. In particular, 
tracking the changes to the skeletal grid cells caused by ex- 
pired objects can be extremely expensive in terms of sys- 
tem resource utilization, and thus constitutes the key per- 
formance bottleneck for skeletal grid cell maintenance. 

When an object pexp expires, it needs the connections 
at the object level, to update the connections among the 
skeletal grid cells. For example, when pexp expires, we first 
need to know which objects are neighbors of Pexp, as their 
neighborships with p^xp will end from now on. This may 
break the connections between the skeletal grid cell Sd in 
which Pnew resides and those in which Pexp's neighbors re- 
side. However, considering the large amount of pair-wise 
neighborships that may exist among the objects, maintain- 
ing all of them has been shown to be extremely expensive in 
terms of system resource utilization, analytically and exper- 
imentally [16]. Therefore, the straightforward incremental 
maintenance method, which updates skeletal grid cells cor- 
responding to each insertion and deletion, is not practical. 

5.3 "lifespan" Analysis 

To solve this computation bottleneck, we present a skele- 
tal grid cell maintenance method using "lifespan" analysis. 
This method elegantly eliminates the need for handling the 
impact of expired objects on the skeletal grid cells. The so- 
lution is based on the observation that in the sliding window 
semantics the lifespan of any object as well as the neighbor- 
ships among objects are deterministic. Therefore, at the 
insertion stage, when we handle the impact of new objects 
on the skeletal grid cells, we take the lifespans of the ob- 
jects into consideration. In particular, we pre-determine the 
changes that will happen to the skeletal grid cells when these 
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objects expire later. Then at the expiration stage, no further 
update is needed to handle the impact of expired objects. 
Thus we avoid the bottleneck discussed above. 

Among the five attributes of a skeletal grid cell, except 
location and side length that are fixed over time, the other 
three, namely population, status and connections are chang- 
ing over time as the objects come and go with each window 
slide. The population of each skeletal grid cell is easily track- 
able with a simple object counter. Thus, we focus on the 
lifespan analysis of the status and the connections. 

Basics for lifespan Analysis. First, we start with an- 
alyzing the lifespan of individual objects. 

Observation 5.2. Given the sUde size Q. slide of a query 
Q and the starting time of the current window Wn-Tstart, 
the lifespan of an object pi in Wn with time stamp pi.T 
is pi.lifespan = f ^* '^'o^sHde'^'^''* 1 ' indicating that pi will 
participate in windows W„ to Wn+Pi.ufespan-i- 

The number of windows that an object pi can survive in 
is determined by after how many window slides that p'^s 
time stamp will still be greater than the starting time of 
the window. Based on the lifespan of individual objects, we 
analyze the lifespan of neighborship between two objects. 

Observation 5.3. Given two objects pi andpj, the neigh- 
borship between them, Neighbor {pi,pj) will hold for 
Neighbor{pi,pj).lifespan = Min{pi.lif espan, pj .lifespan) 
windows, namely, it will exist in all windows from Wn to 
Wn+Neighbor{pi,pj).iifespan-i Until either pi Or pj cxpircs. 

Based on these observations, we can further analyze the 
lifespan of different stages of an object's "career" . 

Observation 5.4. Given an object pi and all tts neigh- 
bors objects Pnbi to Pnbk, the number of windows in which 
Pi will be a core object pi.coreJif espan = Min{pi. lifespan, 
M;m_0°_nei), withwinJ)''jnei the number of windows in which 
at least 9'^ objects within p^ti to Pntk will participate. The 
number of windows in which pi will be edge object 
Pi. edgeJif espan = Min\pi. lifespan — pi.coreJif espan, 
Maxi<j<k (j>nbj .core Jif espan)] 

Basically, an object will be a core object in all the windows 
that it has at least 0''' neighbors. It will be an edge object 
when it core object career ends (no longer has enough neigh- 
bors) but at least one of its neighbors is still a core object. 

lifespan at Grid Cell Level. To tackle skeletal grid 
cell maintenance, now we extend the concept of lifespan 
from the object level to the grid cell level. In particular, we 
analyze how the lifespan of objects, their neighborships and 
their career affects the lifespan of skeletal grid cells' status 
and connections. For each skeletal grid cell Sci, we main- 
tain one lifespan indicator for Sci. status and one for each 
Sci.connections[i]. Each lifespan indicates that, based on 
the objects in the current window, in how many future win- 
dows the value of this attribute will persist. These indicators 
will be updated as new objects arrive. 

Lemma 5.1. Status lifespan. Given a skeletal grid 
cell Sci, all the objects po to pn in Sd , the number of win- 
dows in which Sci will be a core cell SGi.coreJif espan = 
M axo<.i<.„{pi. cor edif espan) 



Lemma 5.1 can be deduced by definition of a core cell (Def. 
4.2). Namely, Sd is a core cell if it contains at least one 
core object. 

Lemma 5.2. Connection lifespan. Given two skele- 
tal grid cells Sci and Scj, and all objects in Sci, Pq°' to 
p'^n' , and all objects in Scj , p^"' to p"m , the number of win- 
dows in which Sci and Scj will he connected is defined as 
Connection{Sci, Scj).lif espan = Max[Min{p%^* .coreJife 
— span,pl^' .coreJif espan, Neighbor{pa^* ,pl^^).lif espan)], 
Va G [0,n], b G [0,m]. 

This indicates that two skeletal grid cells remain connected 
if at least one pair of core objects, each from one skeletal grid 
cell, are neighbors to each other. 

Auxiliary Meta-Data. To insure that we only run one 
range query search (rqs) for each now object and never re- 
run rqs for existing objects, we maintain an auxiliary meta 
information for each object in the window. In particular, we 
maintain a "non-core-career neighbor list" for each object 
Pi to store all pi's neighbors in its "non core career". For 
example, pi currently may have 100 neighbors. Based on the 
lifespan analysis, it will be a core object for 3 windows and 
then due to most of its neighbors expiring, it will become 
a edge object for 2 windows before expiration. In this case, 
the "non-core-career neighbor list" of Pi only contains its 
neighbors in the last 2 windows of its lifespan, say 5 objects. 

The "non-core-career-neighbors" of each object are main- 
tained in a dynamic hash table. The hash table of each 
object Pi is initialized to have n buckets, with n the number 
of windows that pi can survive. The hash key of the table is 
the number of windows that a neighbor object can survive. 
For example, when a data point pi finds a "non-core-career- 
neighbor" Pj , Pj will be added to the fc*'' bucket of the hash 
table, with k the number of windows pj can still survive (if 
k is larger than the number of buckets remained on pi, pj is 
put in the last bucket). At each window slide, we can simply 
remove the whole first bucket of each remaining object, as 
all the neighbors in this bucket must be expired after the 
window slide. 

The number of neighbors in such "non- core- career neigh- 
bor list" is bounded by the constant 9'^. Namely an object 
can never have more than 0'' neighbors in its non-core ca- 
reer, otherwise it would instead be a core object in those 
windows. This theoretical bound guarantees the "lightness" 
of this auxiliary meta-data. Also, it provides all necessary 
access to the objects' neighbors needed in our cluster ex- 
traction process. It thus guarantees that we only run the 
minimum number of range query searches (one for each new 
object) during the clustering. 

5.4 C-SGS Algorithm 

We call our proposed algorithm based on the maintenance 
of skeletal grid cells and lifespan analysis C-SGS. 

Initialization. For a contiiuious clustering query, at the 
initialization stage, C-SGS builds a grid-based index whose 
grid cell size is equal to the size of the finest skeletal grid size 
for this query (see Section 4). We assign to each grid cell 
in this index the same attributes as the skeletal grid cells, 
while we set their status to be noise, density to be "0" , and 
connections to be all "false" initially. 

Handling Insertions. For each new object pnew in- 
serted into the window, C-SGS first loads it into its corre- 
sponding skeletal grid cell based on its position in the data 
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space. Then, we run a range query search for Pnew to iden- 
tify Pnem's neighbors. Based on the Ufespan of Pnew and 
its neighbors (Lemma 5.2), we can determine the hfespan of 
the neighborships among them (Lemma 5.3), as well as the 
lifespan of different stages of p'newS "career" (Lemma 5.4). 
Using this information, we can now update the status and 
connections of the skeletal grid cells in which Pnew falls into 
and in which its neighbors reside. 

For status of skeletal grid cells, the insertion of a new 
object may only cause two types of changes. Namely, it 
may "promote" the skeletal grid cells to become core cells 
or "prolong" their core cell lifespans. 

status promotion: A new object p„ew may promote the 
skeletal grid cell Sci that it resides in to become a core cell, 
if it becomes the first core object in Sci. In this case, we 
set the status of Sci to core cell and set its core cell lifespan 
equal to the core object lifespan of p„em ■ An example of this 
case is shown Case 1 of status promotion in Figure 6. 

Pnem may also cause a status change of a skeletal grid 
cell by upgrading its non-core-object neighbors, which re- 
side in these affected skeletal grid cells, to core objects. In 
this case, for each upgraded neighbor Pupg of Pnew , we first 
determine the lifespan of pupg's career by analyzing itself 
and its neighbors. As every pupg was a non-core object, the 
"non-core-career neighbor list" will help us to quickly access 
all its neighbors without running range query search again. 
Thus, we update the status of the skeletal grid cells in which 
Pupg resides to core cell and set its core grid lifespan equal 
to the core object lifespan of Pupg. Correspondingly, the 
"non-core-career neighbor list" of each Pupg also needs to be 
updated to exclude those objects that will only be neighbor 
of Pupg in its core object career. An example of this case is 
shown in Case 2 of status promotion in Figure 6. 

status prolong: A new object p„e™ may prolong the core 
cell lifespan of the skeletal grid cell Sci in which it resides, if 
p'newS core object lifespan is longer than that of any existing 
object in Sd . In this case, we set Sc'iS core cell lifespan equal 
to the core object lifespan ot Pne-w- An example of this case 
is shown in Case 1 of status prolong in Figure 6. 

Pnem may also prolong the core cell lifespans of the skeletal 
grid cells by extending Pnem's neighbors' core object lifes- 
pan. For each Pnew 's neighbor whose core object lifespan is 
extended because of Pnew's arrival, Pcoie, we first determine 
how long its core object lifespan is extended, by analyzing 
it would have at least 9'^ neighbors in how many more win- 
dows after Pnew joining its neighborhood. Then, we update 
the core cell lifespan of the skeletal grid cell in which each 
Pcoie resides to the core object lifespan of the corresponding 
Pcoie, if the later is longer. An example of this case is shown 
in Case 2 of status promotion in Figure 6. 

For connections of skeletal grid cells, the insertion of a 
new object may only cause two types of changes. Namely, it 
may build new connections between skeletal gnd cells or pro- 
long the lifespan of existing connections. The maintenance 
process of the connections follows the same principles used 
in status maintenance logics (details omitted here for space 
reasons but can be found in [18]). 

Handling Expirations. By using the lifespan analysis 
technique introduced above, the impact to the skeletal gnd 
cells that could be caused by expiring objects has been pre- 
handled when objects arrive. Therefore, no maintenance 
effort is needed for handling cluster structure changes when 
individual objects expire. After the window slides, the only 
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Figure 6: Examples of updating cell status. 9'^ = 4, 
grey circle=edge point, black circle= core point, num- 
ber on each object= number of windows the object 
can survive. 

update needed for the attributes of skeletal grid cells is to 
check whether the new window is out of the lifespans. If 
the new window is out of its core cell lifespan, its status 
needs to be set back to edge cell. If the new window is out 
of the lifespan of any of its connections, the corresponding 
connection needs to be set back to "false" . 

Output Stage. At the output stage, the updated skele- 
tal grid cells can be viewed as the vertices V in a, graph 
G, and the connections among them can be viewed as the 
edges E among the vertices. Therefore, we simply conduct 
a depth first search on all the core cells to collect different 
groups of connected core cells and the edge cells attached 
to them. Each connected group of skeletal grid cells con- 
stitutes the SGS summarization of a cluster d, d.SGS. 
Given d.SGS, the full representation of d can be easily 
figured out by collecting all objects covered by core cells in 
d.SGS and those covered by the edge cells in d.SGS and 
connected to at least one core object in d.SGS's core cells. 

6. PATTERN ARCHIVER 

The pattern archiver handles two major tasks, namely 
pattern compression and selective pattern archival. 

6.1 Multi-Resolution Cluster Summarization 

Our proposed cluster summarization SGS supports mul- 
tiple resolutions. In general, the SGS in different levels of 
resolution follow the same design as presented in Section 
4. An SGS of any resolution is composed of a sequence of 
skeletal grid cells, and each skeletal grid cell has the same 5 
attributes introduced before. 

For any cluster Cx, the SGS of Cx formed by the Pat- 
tern Extractor is based on the finest granularity, namely the 
smallest skeletal grids cells. Thus it is of the finest resolu- 
tion. We caU such SGS the "Baste SGS" of Gx- The SGS 
in coarser resolutions are built based on hierarchically com- 
bining the Basic SGS. For a cluster Cx, we say that the 
Basic SGS of Gx is at Level of the resolution hierarchy, 
noted as Gx.SGS^°. Any SGS in a coarser resolution is at 
a Level n denoted as Gx-SGS'^" . 

Each skeletal grid cell in Cx-SGS^" {n > 0), Cx-Scf"^ is 
formed by combining the skeletal grid cells within a certain 
(9) sized hypercube space in Cx.SGS^^-^ . For example, a 2- 
dimensional cluster Gx has SGS in two resolutions (Figure 
5). They are at Levels and 1. If the compression rate 
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6 = 3, each skeletal grid cell of SGS at Level is made by 
combining 3^ adjacent skeletal grid cells at Level 1. Both 
the number of resolutions allowed and the parameter 9 are 
part of the configuration of our system. 

Such compression process of building Cx-SGS^" can be 
finished with a single scan of the skeletal grid cells in 
Cx.SGS^"-K Given C^-SGS^"-^ and to build C^^-SGS^", 
we first generate a set of skeletal grid cells for Cx-SGSn to 
cover the whole data space occupied by corresponding cells 
in the Cx-SGS^"-^ . Then we set the five attributes for 
SCi'' . The side length of any Cx- SGf'^ is simply equal 
to the side length of a skeletal grid cell at Level n-1 times 

6. Any Cx-Scf'^ is a core cell if at least one Gx-Scf"^^ 
covered by it is a core cell. Otherwise, it is an edge cell. 
The population of any Cx-Scf"^ is equal to the sum of the 
population of the Cx.Sc^"-^s covered by it. The connection 
vector of a Cx.Scf'^ is decided by the connections between 
the "boundary" Cx-Sc^"-''- s covered by it and those covered 
by its adjacent cells at level n-1. 

Budget- and Accuracy- Aware Resolution Selection. 
Given the multiple resolution choices, the Pattern Archiver 
can decide in which resolution to archive the patterns based 
on both the system-resource budget and the accuracy re- 
quired by the specific analytical tasks. In our SGS design, 
for a cluster summarization at a certain resolution, both its 
space consumption and conciseness are deterministic and 
easily calculatable. For space consumption, given the ba- 
sic SGS of a cluster extracted, we can easily determine the 
number of skeletal grid cells needed in any other resolution 
for the same cluster, by calculating how many grid cells 
at that resolution are needed to cover the same data space. 
Since the SGS at different resolutions have the same design, 
the amount of information carried by each skeletal grid cell 
in any resolution is fixed. Thus, one can easily determine 
how much storage space is needed exactly for a given cluster 
in any resolution. For accuracy, as the size of the skeletal 
grid cells at all resolutions are known, the analysts would 
know exactly the granularity that their analytical task will 
be working on for a certain resolution. 

6.2 Selective Pattern Archiving 

The Pattern Archiver also selectively picks which clus- 
ters to archive. Currently, our system supports several sim- 
ple but useful cluster selection mechanism, including using 
sampling techniques to select certain numbers of clusters to 
archive in a period of time and using feature selection to 
only archive clusters with certain features (e.g. only archive 
the clusters reaching a certain population or volume). More 
sophisticated pattern selection techniques, such as evolution 
driven techniques, will be studied in our future work. 

7. PATTERN STORAGE AND MATCHING 
7.1 Pattern Organization in Pattern Base 

Our proposed cluster summarization method SGS empow- 
ers us to easily organize the extracted clusters based on their 
features. In particular, we build two indices for the archived 
clusters. One is based on the position of each cluster, and 
the second is based on all other features of each cluster cap- 
tured in SGS. 

We call the first index the locational feature index. As 
multi-dimensional objects, we express the position of each 
cluster using its minimum bounding rectangle (MBR). In 



our system, we employ one of the most widely used indices 
for MBRs, namely the R-tree index to organize them. The 
second index, called the non-locational feature index, orga- 
nizes the clusters based on their non-locational features. We 
use a four-dimensional grid index to organize the clusters' 
SGS, with the four dimensions: the volume (number of skele- 
tal grid cells, the status count (rmmber of core cells), the av- 
erage density and the average connectivity of each cluster. 

7.2 Cluster Matching Process 

The Cluster Matching Queries (see Figure 3) are executed 
by the Pattern Analyzer. To execute such queries, we first 
provide a distance metric (between 0-1) to measure the dis- 
tance between two clusters. The metric is user-customizable 
based on application semantics. 

Dist{Ca, Gb) =PS* Distlocation -|- ^ * Dist„lf_i{Ga, Cb) 
pS,Distlocation = 0\\l,VWi, Distnlf.i = [0, = 1) 

In this distance metric, Distiocation indicates that whether 
two clusters overlap (1) or not (0). ps indicates whether 
the matching is "position-senstivo" (1) or not (0). Distnif^ 
represents the distance of two clusters on a specific non- 
locational feature and Wi represents the analyst-specified 
weight on this feature. 

To use this distance metric, the analyst needs to first 
specify whether the matching required by her application 
is position-sensitive, namely whether the matched clusters 
have to overlap in the data space. For the position-sensitive 
applications, ps = 1. If two clusters are not overlapped, 
DistiocaUon{Ca,Cb) = 1, the largest possible distance be- 
tween two clusters, indicating that the two clusters are not 
similar and no further comparison on other features will be 
needed. For the non-position-sensitive applications, since 
ps = 0, the locational distance between two clusters is con- 
sidered to be 0. 

The second part of the distance metric measures the dis- 
tance between two clusters on the four non-locational fea- 
tures, namely volume, status, population and connectivity. 
The distance on these features are used in both the match 
candidate search and detailed cell level cluster match. 

Candidate Search. Given a to-be-matched cluster, a 
customized distance metric and a distance threshold speci- 
fied by the analyst, our system first searches the potential 
match candidates in the Pattern Base. In the positional- 
sensitive case, the Pattern Analyzer first searches the lo- 
cational feature index for the candidate clusters. If any 
overlapped clusters are found, it will calculate their non- 
locational distance with the to-be-matched clusters, and re- 
turns the similar clusters if the distances are smaller than the 
threshold. In the non-position-sensitive case, the Pattern 
Analyzer directly searches against the non-locational feature 
index for the candidates. Given the distance metric and the 
distance threshold, the Pattern Analyzer can determine the 
range of the search on each dimension (feature). For exam- 
ple, given the volume of the to-be-matched cluster equal to 
20, the weight on size distance is 0.20, the overall distance 
threshold is 0.2, the volume of the candidate clusters have 
to be between 14 and 30. This is because any other number 
X < U\\x > SO will make abs{x -20) /min{x, 20) > (0.2/0.4), 
which will definitely not fulfill the search creteria. The same 
principle can be used on other features to determine the 
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range of search. Given the search ranges on all dimensions, 
the Pattern Analyzer can quickly narrow down the candidate 
clusters to a small subset by searching the feature index. 

Grid Cell Level Cluster Match. Given a to-be- 
matched cluster and a match candidate cluster for it, grid 
cell level cluster match compares the features of two clus- 
ters in their corresponding sub- regions (skeletal grid cells). 
In particular, grid cell level match uses the same customiz- 
able distance metric introduced earlier, while the distance 
between two clusters is now measured by aggregating the dif- 
ference between all the corresponding skeletal grid cell pairs 
in these two clusters. More precisely, given a certain align- 
ment between two clusters Ca and Ct, ^ each skeletal grid 
cell Sci in Ca may have a corresponding skeletal grid cell 
in Scj, depending on whether its corresponding sub-region 
is also covered by Scj. If Sci has a corresponding skele- 
tal grid cell Scj in Cj, their difference can be measured by 
comparing their status, density and connectivity features. 
Otherwise, Sci is assigned the maximum difference with its 
corresponding sub-region, which is not a part of C'b and thus 
can viewed as an empty grid. When calculating the distance 
between two clusters Ca and Cb- we sum the difference be- 
tween each Sci in Ca and its corresponding sub-region in Cb 
to form the overall distance between the two clusters. 

In the position-sensitive cases, no alignment is needed, 
or in other words, the alignment vector is always equal to 
[0,0,. ..,0]. This is because such applications require any 
skeletal grid cell Sci in Ca to be matched with the skele- 
tal grid cell Scj in Cb that have the same absolute position 
in the data space. Therefore in such cases, we only need a 
single scan on the skeletal grid cells in two clusters to cal- 
culate the distances between them. 

In the non-position-sensitive case, one or more alignments 
that minimize the distance between two clusters may exist. 
When given sufficient computation time, such as in an of- 
fline computation, one could apply an exhaustive search to 
find such an optimal alignment. In our system, for online 
computation, we use an A* style anytime search algorithm 
to search for the best alignment within a certain compu- 
tation time budget. In particular, we start with an align- 
ment that makes two clusters well overlapped. Then we 
continuously search along the direction of the most promis- 
ing "nearby" alignment, which gives the smallest distance so 
far. When the given comijutation time budget is reached, 
we stop searching and return the smallest distance found so 
far as the distance between the two clusters. 

8. EXPERIMENTAL EVALUATION 

We conducted our experiments on a Dell desktop with 

an Intel Gorc2 2.2GHz processor and 3GB memory, running 
Windows 7 professional. We implemented the algorithms in 
YC++ 7.0. 

Real Datasets. We used two real streaming datasets in 
our experiments. The first dataset, GMTI (Ground Moving 
Target Indicator) [6], records the real-time information on 
moving objects gathered by 24 different ground stations or 
aircraft in 6 hours from JointSTARS. It has around 100,000 



An alignment for two Skeletal Grid Summarizations (SGS) is a loca- 
tion shifting vector. For example, given two three dimensional clus- 
ters Ca and Ch, an alignment equal to [1,2,1] indicates that any skele- 
tal grid cell in Ca with location vector equal to [x,y,z] corresponds to 
a skeletal grid cell in Cb with location vector equal to [x-t-l,y-t-2,z-|-l], 
if any. 



records regarding the information on vehicles and helicopters 
(speed ranging from 0-200 mph) moving in a certain geo- 
graphic region. The second real dataset we use is the Stock 
Trading Traces data (STT) from [11], which has one million 
transaction records throughout the trading hours of a day. 

For the experiments that involve data sets larger than the 
sizes of these two datasets, we append rrmltiple rounds of 
the original data varied by setting random differences on all 
attributes, until it reaches the desired size. 

Alternative Summarization Formats. We compare 
our proposed Skeletal Grid Summarization (SGS) with three 
alternative cluster summarization formats. 1) The tradi- 
tional "Centroid- Radius-Density" summarization (CRD). 2) 
Random Sampling Summarization (RSP). RSP for each clus- 
ter is generated by sampling the cluster members at a cer- 
tain sampling rate R. To compare RSP with our proposed 
SGS summarization, for each specific cluster in the experi- 
ment, R is always controlled to let its RSP have the same 
memory consumption with the SGS for the same cluster. 3) 
Skeletal Point Set (SkPS) summarization, our initial cluster 
summarization design proposed in Section 4.2. 

8.1 Efficiency of Cluster Extraction + Sum- 
marization 

In this experiment, wc evaluate that how many system re- 
sources arc needed to generate the alternative cluster sum- 
marizations respectively. Since our proposed solution, C- 
SGS, incorporates cluster extraction and summarization into 
a single process, we compare its performance with the follow- 
ing alternatives. 1) Extra-N: Extract clusters using state- 
of-the-art algorithm Extra-N [16] but do not generate any 
cluster summarization. 2) Extra-N -|- CRD: Extract clusters 
using Extra-N and then generate CRD for each extracted 
cluster. 3) Extra-N -|- RSP: Extract clusters using Extra-N 
and then generate RSP for each extracted cluster. 4) Extra- 
N -|- SkPS: Extract clusters using Extrar-N algorithm and 
then generate (approximated) SkPS for each cluster using 
MG algorithm proposed in [9]. 

We first run each alternative method against the STT 
stream to extract clusters based on four dimensions, namely 
the transaction type (buy/sell), price, volume and time. To 
compare the performance of the alternatives when handling 
clusters with different characteristics, we use three different 
query parameter settings, namely case 1: {9^ = 0.05, 9'^ = 
10), case 2: (r = 0.1, 9" = 8), case 3: {9'' = 0.2, 9" = 5). 
Also, for each case, we use three different window parameter 
settings, namely we fix the window size (win) for all three 
settings at lOK tuples, while varying the slide size slide to 
equal to O.IK, IK and 5K tuples respectively. 

For each case, we first verify the correctness of our 
proposed C-SGS cluster extraction method by comparing 
the clusters extracted by it in full representation with those 
extracted by the state-of-the art technique Extra-N. In all 
the test cases, we found that the clusters extracted by C- 
SGS are identical with those extracted by Extra-N. 

For efficiency, we measure two major performance met- 
rics for stream processing: 1) The average response time for 
each window (denoted as Response Time). For each win- 
dow, we measure the average GPU time elapsed from the 
time that all new data have arrived to the time that all 
clusters have been output in both the full and summarized 

All clustering algorithms following definition in [8] should produce 
the same clustering results given a same input object sequence. 
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representation. The average response time for each window 
shown in all cases are averaged among running for lOK win- 
dows. 2) The memory footprint, namely the peak memory 
utilization of each alternative, among the lOK windows. 

As shown in Figure 8.1, compared to Extra-N, which ex- 
tracts clusters only but does not generate any cluster sum- 
marization (the baseline case), the other four alternatives, 
each generating a specific type of cluster summarization, 
exhibit some overheads in terms of CPU time utilization. 
However, such overhead caused by C-SGS, Extra-N -I- CRD, 
and Extra-N + RSP, is very modest, if not neglectable. The 
reason for such modest overhead caused by Extra-N -I- CRD 
and Extra-N + RSP is obvious. This is because CRD and 
RSP are very simple summarization formats that can easily 
be generated by at most two scans of the cluster members of 
each cluster. The overhead caused by our proposed solution 
C-SGS is comparable with those two simple summarization 
methods. This is because the major computation needed for 
generating the SGS cluster summarization, namely deter- 
mining the status and connections among skeletal grid cells, 
is elegantly piggy-backed by the cluster extraction process 
itself. The CPU overhead of Extra-N + SkPS is signifi- 
cantly higher than that of the other alternatives. This is 
because generating SkPS is very expensive computationally 
[9]. For different window parameter settings, C-SGS has 
lower overhead for the settings with larger win/ slide rates. 
This is because the performance of Extra-N is affected by 
the increasing number of "views" that needs to be main- 
tained, which is equal to win/ slide (see [16] for details), 
while the meta-data maintained by C-SGS and the corre- 
sponding maintenance effort is independent from this ratio. 

Memory- wise, as shown in (Figure 8.1), our proposed meth- 
od C-SGS also exhibits very limited overhead in all test 
cases. This is because the process of generating SGS hap- 
pens in place with the cluster extraction process. 

Similar performances are also observed in the same exper- 
iments but using GMTI data. We have also conducted an 
experiment showing the superiority of our proposed method 
when using time-based windows and under fluctuating in- 
put rate. The details of these experiments mentioned can 
be found in our technical report [18]. 

In conclusion, using our proposed C-SGS solution, we can 
efficiently generate the Skeletal Grid Summarization (SGS) 
for extracted clusters during online clustering process, with 
very limited system resource overhead. 

8.2 Efficiency of Cluster Matching Queries 

Next, we study the performance for running the cluster 
matching queries using our proposed summarization format 
SGS and other alternative summarization formats. We run 
three queries using the same pattern parameter settings as 
used in the previous experiment but with the same window 
parameter setting {Win — lOK, Slide = IK) against the 
STT data using our proposed C-SGS method. We vary the 
size of the Pattern Base equal from O.IK, IK and lOK respec- 
tively. In each test case, we run each clustering query and 
archive all the clusters detected into the Pattern Base un- 
til the required number of archived clusters is reached. For 
each archived cluster, we also generate and keep the other 
three alternative cluster summarization formats for evalu- 
ating other matching methods. Once the required number 
of clusters is archived, we stop archiving and randomly pick 
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Figure 7: CPU time and Memory comparisons for 
generating alternative summarizations. 



100 newly detected clusters as to-be-matched clusters. For 
each to-be-matched cluster, we run four matching queries 
for it against the archived clusters, each using one alterna- 
tive cluster summarization method and the corresponding 
distance metric. In particular, we implement a subtraction 
function to measure the distance between the CRD of two 
clusters, which gives equal weight to the three cluster fea- 
tures captured in CRD, namely the centroid, range and den- 
sity. We use the subset matching algorithm presented in [15] 
to calculate the distance between the RSP of two clusters. 
We use the graph edit distance algorithm presented in [13] 
to calculate the distance between the SkPS of two clusters. 
We give equal weight to all four features when measuring 
the distance between the SGSs of two clusters. 

For each Pattern Base size, we measure the average re- 
sponse time for all cluster matching queries and memory 
space consumed by storing cluster summarizations. 

As shown in Figure 8, when matching against O.IK clus- 
ters, the average response time for each cluster matching 
query using SGS is less than 0.1 second. For the IK and 
lOK cases, the average response time for our solution is only 
around 0.5 seconds and 3 seconds. Such high efficiency is 
comparable with cluster matching using CRD, which is very 
fast because of its extremely simple matching mechanism 
(simply three subtraction operations). This is due to the 
design of SGS, which effectively summarizes the key features 
of each cluster on both cluster and grid levels. In particu- 
lar, by using our proposed two-phase matching strategy, the 
majority of the candidates in the pattern base are filtered 
out in the summarization matching phase. Thus, the more 
expensive grid level matching is only needed for a very small 
portion of the candidates. In our experiment, we found that 
only 6% of the candidate clusters necessitated the grid level 
match on average during the cluster matching process. 

Memory-wise, SGS consumes only 0.12M, 1.38M and 
12.24M memory space to store O.IK, IK and lOK clusters 
respectively (Figure 8). In particular, each 4-dimensional 
skeletal grid cell only consumes 23 bytes, position: 16 bytes 
(4 integers), status: 1 byte (1 boolean), density: 4 bytes (1 
integer), connection: 2 bytes (2* = 16 booleans). In all test 
cases, the average number of skeletal grid cells in each clus- 
ter is 68. Therefore, only 1.5K memory is needed to store the 
SGS of each cluster on average. Compared with the mem- 
ory space needed for storing the full representation of the 
clusters, which need 6.4M, 75. 2M and 680. 2M to store O.IK, 
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IK and lOK clusters respectively, the average compression 
rate of SGS in our experiment is around 98%. 

In conclusion, our proposed solution demonstrates high 
efficiency for cluster matching queries, which is significantly 
better than matching SkPS or RSP. Its performance is com- 
parable with matching simple CRD cluster summarization. 
However, matching CRD is shown to have a much worse 
cluster matching quality compared with our proposed 
method of matching SGS (see next experiment below). 




O.lKclusters IKclusters lOKcluster! O.lKclustsrs IKclusters lOKclurtsrs 



We also conducted a series of experiments to confirm both 
the efficiency and effectiveness of cluster matching queries 
when using SGS with different resolutions. The details of 
those experiments can be found in our technical report [18]. 

9. CONCLUSION 

In this work, we present a framework to support summa- 
rization and matching of density-based clusters in streaming 
environments. First, our work solves several open problems 
for density-based cluster analysis, namely, designing a de- 
scriptive yet compact summarization method for such clus- 
ters. Second, we present an efficient computation strategy 
to quickly summarize the detected clusters into SGS during 
the online clustering. Lastly, we design a cluster archiv- 
ing and matching mechanism, which allows the analysts to 
submit cluster matching queries to find similar clusters de- 
tected earlier in the stream history. Our experimental study 
demonstrates the clear superiority of our proposed methods 
on both the efficiency and effectiveness. 



Figure 8: CPU time and Memory comparison for 
cluster matching queries using alternative cluster 
summarization methods 



8.3 Quality of Cluster Matching 

To measure the quality of cluster matching using alterna- 
tive summarization formats, we invited 20 human analyts 
(all WPI graduate students) to visually analyze the simi- 
larity between the to-be-matched cluster and the matched 
clusters found for them using one alternative method. The 
analysis process is supported by ViStream [14], a freeware 
multivariate data visualization tool, which has been shown 
to be effective for helping human analysts to observe and 
understand multi-dimenstional clusters in streams. For each 
to-be-matched cluster, the analysts are asked to rate the top 
three similar clusters found by each summarization format 
into three categories, namely "very similar" , "similar" , and 
"not similar". 

H n s i nn i I ^ r si m i l^r M y e ry s i nn i I a r 

lJUl 



Figure 9: Similar rate given by users for matched 
clusters found by alternative summarizations 

As shown in Figure 8.3, our proposed summarization 
method SGS demonstrates a high "similar rate", which is 
significantly better than all the other alternatives. This in- 
dicates that the human analysts agree with most of the sim- 
ilar clusters found using SGS, while disagreeing on a large 
percentage of those found using other alternatives. This 
shows the high effectiveness of SGS summarization in terms 
of cluster matching. Due to page limit, the detailed ex- 
perimental setup and result analysis of this experiment are 
omitted here but can be found in our technical report [18]. 
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